# Multimodal reasoning

GLM 4.1V 9B Thinking
MIT
GLM-4.1V-9B-Thinking is an open-source vision-language model based on the GLM-4-9B-0414 foundation model, focusing on improving the reasoning ability in complex tasks and supporting a 64k context length and 4K image resolution.
Image-to-Text Transformers Supports Multiple Languages
G
THUDM
163
95
Kimi VL A3B Thinking 2506
MIT
Kimi-VL-A3B-Thinking-2506 is an upgraded version of Kimi-VL-A3B-Thinking, with significant improvements in multimodal reasoning, visual perception and understanding, video scene processing, etc. It supports higher-resolution images and can achieve more intelligent thinking while consuming fewer tokens.
Image-to-Text Transformers
K
moonshotai
515
67
Magistral Small 2506 Vision
Apache-2.0
Magistral-Small-2506-Vision is an inference fine-tuned version based on Mistral Small 3.1 with GRPO training, an experimental checkpoint with visual capabilities.
Image-to-Text Safetensors Supports Multiple Languages
M
OptimusePrime
125
5
Stockmark 2 VL 100B Beta
Other
Stockmark-2-VL-100B-beta is a Japanese-specific vision-language model with 100 billion parameters, equipped with chain-of-thought (CoT) reasoning ability and can be used for document reading and comprehension.
Image-to-Text Transformers Supports Multiple Languages
S
stockmark
184
8
Internvl3 8B
Apache-2.0
InternVL3 - 8B is an advanced multimodal large - language model with excellent multimodal perception and reasoning capabilities, capable of processing multimodal data such as images and videos.
Multimodal Alignment Transformers
I
unsloth
224
1
Internvl3 1B GGUF
Apache-2.0
InternVL3 - 1B is an advanced multimodal large language model that excels in multimodal perception, reasoning, and other abilities. It also expands multimodal capabilities such as tool use and GUI agent.
Multimodal Fusion Transformers
I
unsloth
868
2
Visionreasoner 7B
Apache-2.0
VisionReasoner-7B is an image-text-to-text model that adopts a decoupled architecture and consists of a reasoning model and a segmentation model. It can interpret user intentions and generate pixel-level masks.
Image-to-Text Transformers English
V
Ricky06662
2,398
1
Qwen3 8B
Apache-2.0
Qwen3-8B is the latest large language model in the Qwen series. It has a variety of advanced features, supports multiple languages, and performs excellently in reasoning, instruction following, etc., bringing users a more intelligent and natural interaction experience.
Large Language Model Transformers
Q
unsloth
30.23k
5
Synthia S1 27b Bnb 4bit
Synthia-S1-27b is an advanced reasoning AI model developed by Tesslate AI, focusing on logical reasoning, coding, and role-playing tasks.
Text-to-Image Transformers
S
GusPuffy
858
1
Gemma 3 27b It GGUF
GGUF quantized version of Gemma 3 with 27B parameters, supporting image-text interaction tasks
Text-to-Image
G
Mungert
4,034
6
Spec Vision V1
MIT
Spec-Vision-V1 is a lightweight, state-of-the-art open-source multimodal model designed for deep integration of visual and textual data, supporting a 128K context length.
Text-to-Image Transformers Other
S
SVECTOR-CORPORATION
17
1
Mulberry Qwen2vl 7b
Apache-2.0
The Mulberry model is a step-by-step reasoning-based model trained on the Mulberry - 260K SFT dataset generated through collective knowledge search.
Text-to-Image Transformers
M
HuanjinYao
13.57k
1
Meditron 7b Llm Radiology
Apache-2.0
This is an open-source model under the Apache-2.0 license. Specific information needs to be supplemented.
Large Language Model Transformers
M
nitinaggarwal12
26
1
DNABERT S
Apache-2.0
This is an open-source model based on the Apache-2.0 license. Specific functionalities should be referenced in the actual model documentation
Large Language Model Transformers
D
zhihan1996
2,815
7
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase